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We Claim: 

1. A method for determining a degree of similarity between documents, the method 
comprising the steps of: 

storing, for at least two documents, labeled tree representations of 
respective documents; 

storing, for at least two documents, path representations relating to paths 
that occur in the documents from root nodes to leaf nodes in the labeled tree 
representations of the respective documents; and 

calculating a measure of similarity between two of the documents based 
upon the frequency of occurrence of similar paths specified by the path 
representations. 

2. The method as claimed in claim 1, wherein the tree representation is a Document 
Model Object representation. 

3. The method as claimed in claim 1, further comprising the step of generating a 
path representation for a path of a document as a sequence of labels 
representative from a root node to a leaf node in the labeled tree representation of 
the document. 

4. The method as claimed in claim 1, further comprising the step of storing, as path 
representations, sets of sequenced labels representative of distinct paths in a 
labeled tree representation of a corresponding document. 

5. The method as claimed in claim 4, further comprising the step of storing a path 
dictionary (Dictpaths = {pi, P2, ...j Pn}) of distinct paths collated from a tree 
representation for a document. 
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6. The method as claimed in claim 5, further comprising the step of eliminating 
selected paths from the path dictionary (Dictpahs). 

7. The method as claimed in claim 6, wherein paths that occur highly frequently or 
5 highly infrequently are eliminated from the path dictionary (Dict pa ths). 

8. The method as claimed in claim 7, further comprising the step of computing the 
frequency of occurrence (f/pi)) of a path (pi) in a document (dj). 

10 9. The method as claimed in claim 8, further comprising the step of computing the 

maximum number of instances (fmax = max,, fj(pij) in which a path (pi) in the 
document (dj) occurs. 

10. The method as claimed in claim 9, further comprising the step of storing a 
15 representation of the document (dj) as a N-dimensional vector ([dj], dj2,.., djx], 

where djk = fj(pk)lfmax, 1 < k<N) of relative frequencies of occurrence (f/phj) of 
paths (pii) in the document (dj). 

11. The method as claimed in claim 8, further comprising the step of computing the 
20 minimum number of instances (f min = min,y f/pi)) in which a path (pi) in the 

document (dj) occurs. 

12. The method as claimed in claim 10, further comprising the step of computing the 
similarity between a pair of documents (</,-, di) as a function (sim(d i9 dij) of 

25 metrics relating the number of paths common to the respective documents (di , 

*). 

13. The method as claimed in claim 12, wherein the function for computing the 
similarity between a pair of documents (dj , dj) 



30 



Emin^A) 
(sim(d h di) = simid^d^-^ 1 ) 

£max(^,^) 
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is the quotient of a numerator, defined as the sum for all paths (k = 1 ... AO of the 
minimum number of instances (min(^, du)) in which paths occur in the 
respective documents (</,-, <//), and a denominator, defined as the sum for all paths 
5 (k = 1 ... AO of the maximum number of instances (min(<//*, dik)) in which paths 

occur in the respective documents (</*, di). 

14. The method as claimed in claim 1, wherein the tree representation of a document 
includes a positional index, which represents, for a node (n), the number of 

10 previous sibling nodes with the same label as that of node (h). 

15. The method as claimed in claim 14, further comprising the step of storing as a 
path representation a set that defines positional information of sibling nodes 
under a parent node. 

15 

16. The method as claimed in claim 15, further comprising the step of storing precise 
path representations that precisely define a document structure, and generalised 
path representations that partially generalise structural aspects of precise path 
representations of a document. 

20 

17. The method as claimed in claim 16, wherein the step of calculating the measure 
of similarity involves determining a total number of precise path representations 
of one document that are either shared by the other document, or are a subsumed 
subset of at least one of the generalised path representations of the other 

25 document. 

18. The method as claimed in claim 17, further comprising the step of normalising 
the measure of similarity by a term that represents the number of unique path 
representations shared by the two documents. 

30 

19. The method as claimed in claim 18, wherein the number of unique path 
representations is calculated by adding the number of path representations for 
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each document, and subtracting from this total the number path representations 
shared by the two documents. 

The method as claimed in claim 14, further comprising the step of storing as a 
path representation a sequence of terms separated by a delimiting symbol, in 
which each term is represented by a label and a parenthesised predicate that 
specifies the positional index of the term either specifically or generally. 

Computer software, recorded on a medium, for determining a degree of 
similarity between documents, the computer software comprising: 

software code means for storing, for at least two documents, labeled tree 
representations of respective documents; 

software code means for storing, for at least two documents, path 
representations relating to paths that occur in the documents from root nodes to 
leaf nodes in the labeled tree representations of the respective documents; and 

software code means for calculating a measure of similarity between 
two of the documents based upon the frequency of occurrence of similar paths 
specified by the path representations. 

A computer system for determining a degree of similarity between documents, 
the computer system comprising: 

means for storing, for at least two documents, labeled tree 
representations of respective documents; 

means for storing, for at least two documents, path representations 
relating to paths that occur in the documents from root nodes to leaf nodes in the 
labeled tree representations of the respective documents; and 
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means for calculating a measure of similarity between two of the 
documents based upon the frequency of occurrence of similar paths specified by 
the path representations. 
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